{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 通用embedding模型微调\n",
    "在RAG链路中，往往会遇到一些检索问题，例如在垂直领域中，很多简单的问题却回答不对，这种情况大概是因为召回阶段效果不佳。当通用embedding模型在你的数据上面表现不佳时，你需要考虑微调模型，具体地，以下是整个流程的具体步骤。\n",
    "\n",
    "- 对文档处理，得到切片后的纯文本\n",
    "- 撰写promp，通过大模型得到合成query\n",
    "- 构造数据集，得到正负例子数据\n",
    "- 微调模型，评估效果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.文档处理\n",
    "文档处理涉及到怎么把结构化的文档转换成纯文本和切片的过程。目前市面上有很多文档处理的工具，本质上有两种方案，一种是通过解析工具提取到文档中内嵌的文本，另一种则是通过OCR的方式进行转换。两种方法都需要取舍，取决于你的场景，解析的方法往往很难去掉干扰信息，例如水印、页眉页脚，而OCR的方式识别的也不尽然全部正确，例如对于0和O这种存在一定的错误可能。\n",
    "这里推荐使用的解析工具是：https://github.com/VikParuchuri/marker，\n",
    "对于常见的文档效果基本够用了。切片的话，可以使用llamaindex，或者简单的正则进行切片。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 合成query\n",
    "假设已经得到了切分好的文档，并且已经按照规范的格式整理成在一个文件中，以下是格式范例：\n",
    "``` json\n",
    "{\"content\": \"这是文档1\"}\n",
    "{\"content\": \"这是文档2\"}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json, jsonlines\n",
    "from ultrarag.modules.llm import OpenaiLLM\n",
    "\n",
    "src_file_path = \"你的切片数据\"\n",
    "dst_file_path = \"你的合成数据输出路径\"\n",
    "\n",
    "with open(\"data_synth.pmt\", \"r\", encoding=\"utf8\") as fr:\n",
    "    propmts = fr.read()\n",
    "\n",
    "with jsonlines.open(src_file_path, \"r\") as fr:\n",
    "    corpus = list(fr)\n",
    "\n",
    "exector = OpenaiLLM(api_key=\"\", base_url=\"\", model=\"\")\n",
    "\n",
    "query_pos_list = []\n",
    "for item in corpus:\n",
    "    content = item['content']\n",
    "    messages = [dict(role=\"user\", content=propmts.replace(\"{content}\", content))]\n",
    "\n",
    "    MAX_RETRIES = 3\n",
    "    for _ in range(MAX_RETRIES):\n",
    "        try:\n",
    "            resp = await exector.arun(messages=messages, stream=False)\n",
    "            resp = resp.strip(\"``` json\")\n",
    "            resp = json.loads(resp)\n",
    "            query = [q[\"query\"] for q in resp]\n",
    "            break\n",
    "        except:\n",
    "            pass\n",
    "    query_pos_list.extend([dict(query=q, content=content) for q in query])\n",
    "\n",
    "with jsonlines.open(dst_file_path, \"w\") as fw:\n",
    "    for item in query_pos_list:\n",
    "        fw.write(item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 构造数据集\n",
    "构造数据集的过程主要是造负例的过程，这里提供一个脚本，名叫negs_build.py, 它通过构造一个大的向量索引的方式，召回一部分文档，并把得分较低的部分拿出来作为负例，你可以选择负例的数目。具体地，这个脚本可以按照以下说明使用。\n",
    "\n",
    "``` bash\n",
    "python ultrarag/datasets/embedding/negs_build.py  \\\n",
    "    -m 'bgem3 embedding模型路径' \\\n",
    "    -q '合成query文件路径' \\\n",
    "    -c '切片文件路径,用于构造索引' \\\n",
    "    -s '训练数据输出路径'\n",
    "```\n",
    "以上代码需要运行在GPU环境下，否则整个流程会很慢。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 模型训练\n",
    "模型训练使用bgem3提供的训练脚本，以下是运行的命令的shell脚本。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 评估效果\n",
    "评估效果的话可以使用脚本recall_test.py进行，这里给出运行脚本的命令。\n",
    "\n",
    "``` bash\n",
    "python ultrarag/evaluate/retrieval_evaluate.py \\\n",
    "    -m \"你的模型路径\" \\\n",
    "    -q \"你的测试集路径[或者从合成的query中筛选一部分作为测试集]\" \\\n",
    "    -c \"切片文件作为知识库索引\" \\\n",
    "    -t \"召回topn参数\"  \\\n",
    "    -s \"召回结果\" \\\n",
    "\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ultrarag",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
